Bioinformatics A Practical Guide to Next Generation Sequencing Data Analysis (Hamid D. Ismail)

Mapping of Sequence Reads to the Reference Genomes ◾ 59

rotations then are sorted alphabetically to form a matrix called BWM (Burrows–Wheeler

Matrix) as shown in the second column of Figure 2.7. The last column of characters (in

red) in the BWM is the BWT of the sequence. As shown in Figure 2.8, the BWT of the

sequence S is “AGG$GGTTCTC”. When an aligner uses BWT, it transforms the entire

genome sequence into a BWT. However, computing a BWT using the above naïve approach

is computationally expensive with O(n2 log n) time complexity. Instead, we can use the

suffix array to compute the BWT of a reference genome in O(n) time complexity. This

strategy is utilized by the aligners that use BWT. Constructing a BWT from a suffix array

is straightforward and very simple. To get the ith character of the BWT of the string S,

whose characters are indexed as i=1, 2, …, n, from the suffix array (A), simply use the

following rule:

BWT(S)[i] = S[A[i] - 1] if A[i] > 1 else S[n]

Figure 2.8 shows the indexes i=1, 2, 3, …,11 (first row) of the sequence S (second row) and

suffix array index A (third row), as shown in Figure 2.7. Now to infer the BWT from A,

let us begin from BWT(S)[11]. By applying the rule, BWT(S)[11] = S[A[11] - 1] = S[10] = A.

Likewise, BWT(S)[10] = S[A[10] - 1] = S[9]= G; BWT(S)[6] = S[A[6] - 1]=S[5]=G; but BWT(S)

[1], A[1] is less than 1; hence, BWT(S)[1] = $. For the BWT(S)[i] corresponding to A=9, 5, 8,

4, 7, 3, 2, we will continue using BWT(S)[i] = S[A[i] - 1]. The BWT is shown in Figure 2.9.

The question that comes up is why we need to transform a sequence of a reference

genome into BWT? The BWT serves two purposes; first, BWT groups the characters of

the sequence so that a single character appears many times in a row because the column

is sorted alphabetically and that can be used for sequence compression, which reduces the

memory storage. In this sense, the BWT is compressible and reversible. The second pur-

pose is that BWT is a data structure that can be used for indexing the sequence of a refer-

ence genome to make finding a position of a read fast.

The property of the BWT is that we can reverse it to obtain the BWM by using the last-

to-first column mapping property or simply as LF mapping, where L is the last column

FIGURE 2.7 Cyclic rotations and sorted rotation.